nchypergeom_wallenius (Wallenius’ noncentral hypergeometric distribution)#
Wallenius’ noncentral hypergeometric distribution models biased sampling without replacement from a finite population with two types of objects.
This notebook uses the same parameterization as scipy.stats.nchypergeom_wallenius:
M= total population size (integer)n= number of Type I objects in the population (integer)N= number of draws (integer)odds= odds ratio (\omega>0) (real)
Learning goals#
By the end you should be able to:
recognize when Wallenius’ noncentral hypergeometric model is appropriate
write down the PMF/CDF and understand the sampling mechanism
compute mean/variance/skewness/kurtosis and entropy numerically
implement NumPy-only sampling (sequential biased draws)
visualize PMF/CDF and validate with Monte Carlo simulation
use
scipy.stats.nchypergeom_walleniusfor computation and inference workflows
Table of contents#
Title & Classification
Intuition & Motivation
Formal Definition
Moments & Properties
Parameter Interpretation
Derivations
Sampling & Simulation
Visualization
SciPy Integration
Statistical Use Cases
Pitfalls
Summary
import math
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
import os
import plotly.io as pio
pio.templates.default = "plotly_white"
pio.renderers.default = os.environ.get("PLOTLY_RENDERER", "notebook")
np.set_printoptions(precision=6, suppress=True)
rng = np.random.default_rng(42)
1) Title & Classification#
Name: nchypergeom_wallenius (Wallenius’ noncentral hypergeometric distribution)
Type: Discrete
We follow the SciPy naming/notation:
M= total population sizen= number of Type I objects (so Type II count isM-n)N= number of draws (without replacement)odds= (\omega), the odds ratio favoring Type I
Let (X) be the number of Type I objects drawn after N biased draws.
Support: [ x \in {x_\ell, x_\ell+1, \dots, x_u} ] where [ x_\ell = \max\bigl(0,; N-(M-n)\bigr),\qquad x_u = \min(N, n). ]
Parameter space: [ M\in\mathbb{N},\qquad 0\le n\le M,\qquad 0\le N\le M,\qquad \omega>0. ]
Interpretation:
M, ndescribe the composition of the urn/populationNis how many draws you takeoddscontrols how strongly Type I is preferred ((\omega>1)) or discouraged ((\omega<1))
2) Intuition & Motivation#
The sampling story (biased urn)#
Imagine an urn with:
nred balls (Type I)M-nblue balls (Type II)
You draw exactly N balls one-by-one without replacement, but the draw is biased:
if there are (r) red and (b) blue remaining, then
[ \mathbb{P}(\text{next draw is Type I}\mid r,b) = \frac{\omega r}{\omega r + b} ]
So (\omega) multiplies the attractiveness of Type I relative to Type II.
A key sanity check: if there is one ball of each color left, then [ \mathbb{P}(\text{Type I}) = \frac{\omega}{\omega+1}. ]
What this distribution models#
Preferential selection without replacement (finite population correction matters)
Biased sampling from two categories when the bias changes dynamically as items are removed
Typical real-world use cases#
Selection bias / preferential sampling: a subgroup is more likely to be selected, but you cannot select the same individual twice
Ecology / capture–recapture: heterogeneous catchability (Type I vs Type II) in a finite pool
Auditing / inspections: inspectors target one class of items more aggressively while sampling without replacement
Recommendation / ranking pipelines: drawing items sequentially with different propensities until a quota is met
Relations to other distributions#
Hypergeometric: when (\omega=1), the bias disappears and we recover the ordinary hypergeometric distribution.
Fisher’s noncentral hypergeometric (
nchypergeom_fisher): another “noncentral hypergeometric” distribution with the same parameters but a different generative story.Wallenius: biased sequential draws (odds applied at each draw)
Fisher: models a different conditioning story (“a handful of objects taken at once”)
Binomial approximation: if
Mis large andNis small (sampling fraction (N/M) is tiny), the changing composition is negligible and
[ X \approx \mathrm{Binomial}\Bigl(N,; p\Bigr),\qquad p = \frac{\omega,(n/M)}{\omega,(n/M) + (1-n/M)} = \frac{\omega n}{\omega n + (M-n)}. ]
That is: for small sampling fractions, “biased without replacement” is close to “biased with replacement”.
3) Formal Definition#
Generative definition (sequential without replacement)#
Let (R_0=n) and (B_0=M-n) be the initial counts (Type I and II remaining). For draws (t=1,2,\dots,N):
[ \mathbb{P}(\text{draw Type I at step }t\mid R_{t-1},B_{t-1}) = \frac{\omega R_{t-1}}{\omega R_{t-1}+B_{t-1}}. ]
After you draw a ball, you decrement the corresponding remaining count.
Let (X) be the total number of Type I objects drawn in the N draws.
PMF (integral representation)#
There is no simple closed-form PMF in elementary functions. A standard integral form is
[ \mathbb{P}(X=x) = \binom{n}{x},\binom{M-n}{N-x} \int_0^1 \left(1-t^{\omega/D}\right)^x\left(1-t^{1/D}\right)^{N-x},dt, \qquad x\in{x_\ell,\dots,x_u} ]
where
[ D = \omega(n-x) + \bigl((M-n) - (N-x)\bigr) = \omega(n-x) + (M-n-N+x). ]
CDF#
The CDF is defined by summing the PMF over the integer support:
[ F(k) = \mathbb{P}(X\le k) = \sum_{x=x_\ell}^{\lfloor k \rfloor} \mathbb{P}(X=x). ]
In practice (and in libraries), the PMF/CDF are computed using specialized numerical methods.
4) Moments & Properties#
Because the support is finite, all moments exist.
Mean, variance, skewness, kurtosis#
Given the PMF (p(x)=\mathbb{P}(X=x)), the standard definitions are:
[ \mu = \mathbb{E}[X] = \sum_x x,p(x), \qquad \sigma^2 = \mathrm{Var}(X) = \sum_x (x-\mu)^2,p(x). ]
Define central moments (\mu_r = \mathbb{E}[(X-\mu)^r]). Then
[ \text{skewness }\gamma_1 = \frac{\mu_3}{\sigma^3}, \qquad \text{excess kurtosis }\gamma_2 = \frac{\mu_4}{\sigma^4}-3. ]
For Wallenius’ distribution, these quantities generally have no simple closed form, but they are easy to compute numerically from the PMF.
MGF / characteristic function#
With the PMF on integer support,
[ M_X(t) = \mathbb{E}[e^{tX}] = \sum_x e^{tx},p(x), \qquad \varphi_X(t) = \mathbb{E}[e^{itX}] = \sum_x e^{itx},p(x). ]
Entropy#
The Shannon entropy (in nats) is
[ H(X) = -\sum_x p(x),\log p(x). ]
Symmetry by swapping types#
If you swap the meaning of Type I and Type II, the count transforms as (X \mapsto N-X), and the odds invert:
[ X\sim\text{Wallenius}(M,n,N,\omega) \quad\Longleftrightarrow\quad N-X\sim\text{Wallenius}(M,M-n,N,1/\omega). ]
(You can view this as “same biased sequential sampling, just relabeled”.)
def _validate_M_n_N_odds(M, n, N, odds):
if isinstance(M, bool) or not isinstance(M, (int, np.integer)):
raise TypeError("M must be an integer")
if isinstance(n, bool) or not isinstance(n, (int, np.integer)):
raise TypeError("n must be an integer")
if isinstance(N, bool) or not isinstance(N, (int, np.integer)):
raise TypeError("N must be an integer")
M = int(M)
n = int(n)
N = int(N)
odds = float(odds)
if M < 0:
raise ValueError("M must be >= 0")
if not (0 <= n <= M):
raise ValueError("n must satisfy 0 <= n <= M")
if not (0 <= N <= M):
raise ValueError("N must satisfy 0 <= N <= M")
if not (odds > 0):
raise ValueError("odds must be > 0")
return M, n, N, odds
def wallenius_support(M, n, N, odds=None):
M, n, N, _ = _validate_M_n_N_odds(M, n, N, odds if odds is not None else 1.0)
x_min = max(0, N - (M - n))
x_max = min(N, n)
return x_min, x_max
def sample_wallenius_numpy(M, n, N, odds, *, size=1, rng: np.random.Generator):
# NumPy-only sampler via sequential biased draws.
# Returns an integer array with shape `size` (like NumPy/SciPy rvs).
M, n, N, odds = _validate_M_n_N_odds(M, n, N, odds)
size = np.atleast_1d(size).astype(int)
if np.any(size < 0):
raise ValueError("size must be non-negative")
if N == 0 or n == 0:
return np.zeros(size, dtype=int)
if n == M:
return np.full(size, N, dtype=int)
# Remaining counts per Monte Carlo replicate.
r = np.full(size, n, dtype=int)
b = np.full(size, M - n, dtype=int)
# Sequentially draw N times.
for _ in range(N):
# Handle edge cases explicitly to avoid division warnings.
p = np.where(
r == 0,
0.0,
np.where(b == 0, 1.0, (odds * r) / (odds * r + b)),
)
u = rng.random(size)
draw_r = u < p
r = r - draw_r.astype(int)
b = b - (~draw_r).astype(int)
return n - r
def discrete_moments_from_pmf(xs, pmf):
# Compute mean/var/skew/excess-kurtosis/entropy from a discrete PMF.
xs = np.asarray(xs, dtype=float)
pmf = np.asarray(pmf, dtype=float)
total = pmf.sum()
if not np.isfinite(total) or total <= 0:
raise ValueError("PMF must sum to a positive finite value")
pmf = pmf / total
mean = float(np.sum(xs * pmf))
centered = xs - mean
var = float(np.sum(centered**2 * pmf))
if var == 0.0:
skew = 0.0
exkurt = -3.0
else:
mu3 = float(np.sum(centered**3 * pmf))
mu4 = float(np.sum(centered**4 * pmf))
skew = mu3 / (var ** 1.5)
exkurt = mu4 / (var**2) - 3.0
p = pmf[pmf > 0]
entropy_nats = float(-np.sum(p * np.log(p)))
return {
"mean": mean,
"var": var,
"skew": float(skew),
"exkurt": float(exkurt),
"entropy_nats": entropy_nats,
}
def mgf_from_pmf(t, xs, pmf):
xs = np.asarray(xs, dtype=float)
pmf = np.asarray(pmf, dtype=float)
pmf = pmf / pmf.sum()
return float(np.sum(np.exp(t * xs) * pmf))
def cf_from_pmf(t, xs, pmf):
xs = np.asarray(xs, dtype=float)
pmf = np.asarray(pmf, dtype=float)
pmf = pmf / pmf.sum()
return complex(np.sum(np.exp(1j * t * xs) * pmf))
def wallenius_mean_field_mean(M, n, N, odds, *, tol=1e-12, max_iter=200):
# Approximate E[X] by a mean-field ODE argument.
#
# This solves for mu in:
# (1 - mu/n)^(1/odds) = 1 - (N - mu)/(M - n)
# within the feasible interval [x_min, x_max].
M, n, N, odds = _validate_M_n_N_odds(M, n, N, odds)
x_min, x_max = wallenius_support(M, n, N, odds)
if N == 0 or n == 0:
return 0.0
if n == M:
return float(N)
m2 = M - n
def f(mu):
mu = float(mu)
left_base = max(0.0, 1.0 - mu / n)
right = 1.0 - (N - mu) / m2
if left_base == 0.0:
left = 0.0
else:
left = math.exp((1.0 / odds) * math.log(left_base))
return left - right
lo, hi = float(x_min), float(x_max)
if lo == hi:
return lo
f_lo, f_hi = f(lo), f(hi)
if f_lo == 0.0:
return lo
if f_hi == 0.0:
return hi
# In extreme cases, numerical bracket might not contain a sign change;
# fall back to an endpoint that matches the direction of bias.
if f_lo * f_hi > 0:
return hi if odds > 1 else lo
for _ in range(max_iter):
mid = 0.5 * (lo + hi)
f_mid = f(mid)
if abs(f_mid) < tol or (hi - lo) < tol:
return mid
if f_lo * f_mid <= 0:
hi, f_hi = mid, f_mid
else:
lo, f_lo = mid, f_mid
return 0.5 * (lo + hi)
from scipy.stats import hypergeom, nchypergeom_wallenius
M, n, N, odds = 140, 80, 60, 0.5
x_min, x_max = wallenius_support(M, n, N, odds)
xs = np.arange(x_min, x_max + 1)
pmf = nchypergeom_wallenius.pmf(xs, M, n, N, odds)
# Sanity check: odds=1 reduces to the ordinary hypergeometric distribution.
pmf_unbiased = nchypergeom_wallenius.pmf(xs, M, n, N, 1.0)
pmf_h = hypergeom.pmf(xs, M, n, N)
max_abs_diff_odds1 = float(np.max(np.abs(pmf_unbiased - pmf_h)))
mom_from_pmf = discrete_moments_from_pmf(xs, pmf)
mean_scipy, var_scipy, skew_scipy, exkurt_scipy = nchypergeom_wallenius.stats(
M, n, N, odds, moments="mvsk"
)
# Monte Carlo moments using the NumPy-only sampler
mc = sample_wallenius_numpy(M, n, N, odds, size=80_000, rng=rng)
{
"support": [int(x_min), int(x_max)],
"pmf_sum": float(pmf.sum()),
"odds1_matches_hypergeom_max_abs_diff": max_abs_diff_odds1,
"mean_pmf": mom_from_pmf["mean"],
"mean_scipy": float(mean_scipy),
"mean_mean_field": wallenius_mean_field_mean(M, n, N, odds),
"mean_mc": float(mc.mean()),
"var_pmf": mom_from_pmf["var"],
"var_scipy": float(var_scipy),
"var_mc": float(mc.var(ddof=0)),
"skew_scipy": float(skew_scipy),
"exkurt_scipy": float(exkurt_scipy),
"entropy_scipy": float(nchypergeom_wallenius.entropy(M, n, N, odds)),
"mgf_t=0.1": mgf_from_pmf(0.1, xs, pmf),
"cf_t=1.0": cf_from_pmf(1.0, xs, pmf),
}
{'support': [0, 60],
'pmf_sum': 0.9999999999759758,
'odds1_matches_hypergeom_max_abs_diff': 1.1518563880485999e-14,
'mean_pmf': 26.649038741000833,
'mean_scipy': 26.64903874100567,
'mean_mean_field': 26.666666666660603,
'mean_mc': 26.6575,
'var_pmf': 8.208241763338343,
'var_scipy': 8.208241761190717,
'var_mc': 8.199268749999998,
'skew_scipy': 0.035671995651632836,
'exkurt_scipy': -0.014465480478890846,
'entropy_scipy': 2.471394837792706,
'mgf_t=0.1': 14.970488004642451,
'cf_t=1.0': (0.003122019680244857+0.015529111336782947j)}
5) Parameter Interpretation#
M and n (population composition)#
Msets the overall population size.n/Mis the baseline fraction of Type I objects.
When N is not tiny relative to M, the fact that we sample without replacement matters: after you draw some Type I objects, there are fewer Type I left, so Type I becomes harder to draw later.
N (number of draws)#
Increasing
Nexpands the support and typically increases both mean and variance.
odds (bias strength)#
odds = 1means unbiased sampling (hypergeometric).odds > 1favors Type I, shifting mass to largerx.odds < 1discourages Type I, shifting mass to smallerx.
In the extreme limits:
(\omega\to\infty): you draw as many Type I objects as possible, so (X\to\min(n,N)).
(\omega\to 0): you avoid Type I until forced, so (X\to \max(0, N-(M-n))).
from plotly.subplots import make_subplots
M, n, N = 80, 30, 25
odds_values = [0.2, 0.5, 1.0, 2.0, 5.0]
fig = make_subplots(
rows=1,
cols=2,
subplot_titles=(
f"PMF vs odds (M={M}, n={n}, N={N})",
f"PMF vs N (M={M}, n={n}, odds=2.0)",
),
)
x_min, x_max = wallenius_support(M, n, N, odds_values[0])
xs = np.arange(x_min, x_max + 1)
for w in odds_values:
pmf = nchypergeom_wallenius.pmf(xs, M, n, N, w)
fig.add_trace(
go.Scatter(x=xs, y=pmf, mode="lines+markers", name=f"odds={w}"),
row=1,
col=1,
)
odds_fixed = 2.0
N_values = [5, 15, 25, 40]
for N_ in N_values:
x_min, x_max = wallenius_support(M, n, N_, odds_fixed)
xs_ = np.arange(x_min, x_max + 1)
pmf_ = nchypergeom_wallenius.pmf(xs_, M, n, N_, odds_fixed)
fig.add_trace(
go.Scatter(x=xs_, y=pmf_, mode="lines+markers", name=f"N={N_}"),
row=1,
col=2,
)
fig.update_layout(
height=420,
width=980,
legend=dict(orientation="h", yanchor="bottom", y=-0.25, xanchor="left", x=0),
)
fig.update_xaxes(title_text="x", row=1, col=1)
fig.update_xaxes(title_text="x", row=1, col=2)
fig.update_yaxes(title_text="P(X=x)", row=1, col=1)
fig.update_yaxes(title_text="P(X=x)", row=1, col=2)
fig.show()
6) Derivations#
Wallenius’ distribution is defined by a sequential, state-dependent sampling process, so closed-form algebraic expressions are rare.
We’ll do three things:
derive a useful mean-field approximation for (\mathbb{E}[X])
show how variance follows from the PMF (numerically)
write down the likelihood for (\omega) and fit it numerically
Expectation (mean-field / ODE derivation)#
Let (a(t)) and (b(t)) be the expected numbers of Type I and II remaining after (t) draws. Using the sequential sampling rule,
[ \mathbb{E}[\Delta a] \approx -\frac{\omega a}{\omega a + b}, \qquad \mathbb{E}[\Delta b] \approx -\frac{b}{\omega a + b}. ]
Treating (t) as continuous gives the ODE approximation
[ \frac{da}{dt} = -\frac{\omega a}{\omega a + b},\qquad \frac{db}{dt} = -\frac{b}{\omega a + b}. ]
Take the ratio:
[ \frac{db}{da} = \frac{db/dt}{da/dt} = \frac{b}{\omega a} \quad\Rightarrow\quad \frac{db}{b} = \frac{1}{\omega},\frac{da}{a}. ]
Integrating:
[ \log b = \frac{1}{\omega}\log a + C \quad\Rightarrow\quad b = C,a^{1/\omega}. ]
Use the initial condition (a(0)=n), (b(0)=M-n) to get
[ \frac{b}{M-n} = \left(\frac{a}{n}\right)^{1/\omega}. ]
After (N) draws, (a(N)=n-\mu) and (b(N)=(M-n)-(N-\mu)) where (\mu=\mathbb{E}[X]). Plugging in yields
[ 1-\frac{N-\mu}{M-n} = \left(1-\frac{\mu}{n}\right)^{1/\omega}. ]
This scalar equation has a unique solution in the feasible interval ([x_\ell, x_u]) and can be solved with bisection.
Variance#
Once you can compute the PMF numerically, variance is
[ \mathrm{Var}(X) = \mathbb{E}[X^2] - \mathbb{E}[X]^2 = \sum_x x^2 p(x) - \left(\sum_x x p(x)\right)^2. ]
Likelihood for (\omega)#
If you observe a single count (x), the likelihood for (\omega) (with M,n,N known) is
[ L(\omega\mid x) = p(x\mid M,n,N,\omega),\qquad \ell(\omega)=\log L(\omega\mid x). ]
For independent observations (x_1,\dots,x_m) with the same parameters, the log-likelihood sums:
[ \ell(\omega) = \sum_{i=1}^m \log p(x_i\mid M,n,N,\omega). ]
Because the PMF is evaluated numerically, the MLE (\hat\omega) is typically found with 1D numerical optimization.
from scipy.optimize import minimize_scalar
M, n, N, odds_true = 120, 50, 40, 2.0
x_obs = int(nchypergeom_wallenius.rvs(M, n, N, odds_true, random_state=rng))
odds_grid = np.logspace(-2, 2, 600)
logL = nchypergeom_wallenius.logpmf(x_obs, M, n, N, odds_grid)
# MLE for odds (bounded search on a wide log-scale interval)
def nll(odds):
return -float(nchypergeom_wallenius.logpmf(x_obs, M, n, N, odds))
res = minimize_scalar(nll, bounds=(1e-3, 1e3), method="bounded")
odds_hat = float(res.x)
fig = go.Figure()
fig.add_trace(go.Scatter(x=odds_grid, y=logL, mode="lines", name="log-likelihood"))
fig.add_vline(x=1.0, line_dash="dash", line_color="gray", annotation_text="odds=1")
fig.add_vline(x=odds_hat, line_dash="dash", line_color="black", annotation_text="MLE")
fig.update_layout(
title=f"Log-likelihood for odds (x_obs={x_obs}, M={M}, n={n}, N={N})",
xaxis_title="odds",
yaxis_title="log L(odds)",
xaxis_type="log",
)
fig.show()
{
"x_obs": x_obs,
"odds_true": odds_true,
"odds_mle": odds_hat,
"opt_success": bool(res.success),
"mean_field_mean_at_true": wallenius_mean_field_mean(M, n, N, odds_true),
}
{'x_obs': 20,
'odds_true': 2.0,
'odds_mle': 999.9999700467109,
'opt_success': True,
'mean_field_mean_at_true': 22.197826063638786}
7) Sampling & Simulation#
A direct NumPy-only sampling algorithm follows the generative story:
Start with
r = nType I andb = M-nType II remaining.Repeat
Ntimes:compute (p = \omega r/(\omega r + b))
draw (U\sim\text{Uniform}(0,1))
if (U<p), record a Type I draw and decrement
r; else decrementb
Return the total Type I draws.
This is exact but costs (O(N\cdot\text{size})). For many settings (moderate N), it’s fast and simple.
SciPy’s rvs uses a specialized, optimized implementation (BiasedUrn).
M, n, N, odds = 200, 60, 40, 1.7
size = 50_000
x_np = sample_wallenius_numpy(M, n, N, odds, size=size, rng=rng)
x_sp = nchypergeom_wallenius.rvs(M, n, N, odds, size=size, random_state=rng)
{
"np_mean": float(x_np.mean()),
"scipy_mean": float(x_sp.mean()),
"scipy_theory_mean": float(nchypergeom_wallenius.mean(M, n, N, odds)),
"np_var": float(x_np.var(ddof=0)),
"scipy_var": float(x_sp.var(ddof=0)),
}
{'np_mean': 16.27202,
'scipy_mean': 16.25174,
'scipy_theory_mean': 16.265095957990887,
'np_var': 7.4177051196,
'scipy_var': 7.4803669724}
8) Visualization#
We’ll visualize:
the PMF on the integer support
the CDF (as a step function)
Monte Carlo samples vs the exact PMF
M, n, N, odds = 140, 80, 60, 0.5
x_min, x_max = wallenius_support(M, n, N, odds)
xs = np.arange(x_min, x_max + 1)
pmf = nchypergeom_wallenius.pmf(xs, M, n, N, odds)
cdf = np.cumsum(pmf)
fig_pmf = go.Figure()
fig_pmf.add_trace(go.Bar(x=xs, y=pmf, name="PMF"))
fig_pmf.update_layout(
title=f"Wallenius PMF (M={M}, n={n}, N={N}, odds={odds})",
xaxis_title="x",
yaxis_title="P(X=x)",
)
fig_pmf.show()
fig_cdf = go.Figure()
fig_cdf.add_trace(go.Scatter(x=xs, y=cdf, mode="lines", line_shape="hv", name="CDF"))
fig_cdf.update_layout(
title=f"Wallenius CDF (M={M}, n={n}, N={N}, odds={odds})",
xaxis_title="x",
yaxis_title="P(X≤x)",
)
fig_cdf.show()
# Monte Carlo comparison
mc = sample_wallenius_numpy(M, n, N, odds, size=120_000, rng=rng)
counts = np.bincount(mc - x_min, minlength=(x_max - x_min + 1))
pmf_hat = counts / counts.sum()
fig_mc = go.Figure()
fig_mc.add_trace(go.Bar(x=xs, y=pmf_hat, name="Monte Carlo", opacity=0.6))
fig_mc.add_trace(go.Scatter(x=xs, y=pmf, mode="lines+markers", name="Exact (SciPy)"))
fig_mc.update_layout(
title="Monte Carlo vs exact PMF",
xaxis_title="x",
yaxis_title="Probability",
)
fig_mc.show()
9) SciPy Integration#
SciPy provides scipy.stats.nchypergeom_wallenius:
pmf,logpmfcdf,sf(survival function),ppfrvsstats,mean,var,entropy
About .fit#
Many discrete SciPy distributions (including nchypergeom_wallenius) do not implement a generic .fit() method.
In practice, you usually know M,n,N from the study design and only want to estimate the bias parameter odds.
We’ll show a simple custom MLE for odds using logpmf.
M, n, N, odds = 140, 80, 60, 0.5
rv = nchypergeom_wallenius(M, n, N, odds)
x_min, x_max = wallenius_support(M, n, N, odds)
xs = np.arange(x_min, x_max + 1)
pmf = rv.pmf(xs)
cdf = rv.cdf(xs)
samples = rv.rvs(size=20_000, random_state=rng)
{
"pmf_sum": float(pmf.sum()),
"cdf_last": float(cdf[-1]),
"sample_mean": float(samples.mean()),
"theory_mean": float(rv.mean()),
"theory_var": float(rv.var()),
"entropy": float(rv.entropy()),
}
{'pmf_sum': 0.9999999999759758,
'cdf_last': 1.0,
'sample_mean': 26.67215,
'theory_mean': 26.64903874100567,
'theory_var': 8.208241761190717,
'entropy': 2.471394837792706}
from scipy.optimize import minimize_scalar
# Custom 1D MLE for odds given data and known (M,n,N)
M, n, N, odds_true = 120, 50, 40, 1.8
x = nchypergeom_wallenius.rvs(M, n, N, odds_true, size=2_000, random_state=rng)
# Negative log-likelihood for odds (vectorized over data)
def nll_odds(odds):
if not (odds > 0):
return float("inf")
return -float(nchypergeom_wallenius.logpmf(x, M, n, N, odds).sum())
res = minimize_scalar(nll_odds, bounds=(1e-3, 1e3), method="bounded")
{
"odds_true": odds_true,
"odds_mle": float(res.x),
"opt_success": bool(res.success),
}
{'odds_true': 1.8, 'odds_mle': 999.9999700467109, 'opt_success': True}
10) Statistical Use Cases#
A) Hypothesis testing (bias vs no bias)#
A common question: is selection unbiased? In this setting, “unbiased” corresponds to (\omega=1) and the distribution reduces to the ordinary hypergeometric.
Given an observed (x_\text{obs}), one-sided p-values are computed with the survival function:
[ \text{p-value} = \mathbb{P}(X\ge x_\text{obs}\mid \omega=1) = \mathrm{sf}(x_\text{obs}-1). ]
You can also compare (\omega=1) vs (\omega\ne 1) using a likelihood ratio test (asymptotic (\chi^2_1) calibration).
B) Bayesian modeling (posterior over odds)#
Put a prior on (\log\omega) (e.g., Normal) and compute the posterior on a grid:
[ \log p(\omega\mid x) = \log p(x\mid \omega) + \log p(\omega) + \text{const}. ]
Because the parameter is 1D, grid Bayes is often perfectly adequate.
C) Generative modeling#
In synthetic-data pipelines, Wallenius’ distribution is a handy component when you need to simulate quota-based biased selection without replacement.
from scipy.stats import chi2
from scipy.optimize import minimize_scalar
# Example: test enrichment under the unbiased model (odds=1)
M, n, N = 200, 60, 40
x_obs = 18
p_ge = nchypergeom_wallenius.sf(x_obs - 1, M, n, N, 1.0)
# Likelihood ratio test: H0 odds=1 vs H1 odds free
def nll1(odds):
return -float(nchypergeom_wallenius.logpmf(x_obs, M, n, N, odds))
res = minimize_scalar(nll1, bounds=(1e-3, 1e3), method="bounded")
ll_hat = -float(res.fun)
ll_null = float(nchypergeom_wallenius.logpmf(x_obs, M, n, N, 1.0))
LR = 2 * (ll_hat - ll_null)
p_lr = float(chi2.sf(LR, df=1))
{
"x_obs": x_obs,
"p_value_one_sided_ge": float(p_ge),
"odds_mle": float(res.x),
"LR_stat": float(LR),
"p_value_LR_chi2_approx": p_lr,
}
{'x_obs': 18,
'p_value_one_sided_ge': 0.018665484494152507,
'odds_mle': 999.9999700467109,
'LR_stat': -inf,
'p_value_LR_chi2_approx': 1.0}
# Simple grid Bayes for odds (prior on log-odds)
M, n, N = 200, 60, 40
x_obs = 18
log_odds_grid = np.linspace(-3.0, 3.0, 600)
odds_grid = np.exp(log_odds_grid)
# Prior: log(odds) ~ Normal(0, 1.0) (up to an additive constant)
log_prior = -0.5 * (log_odds_grid / 1.0) ** 2
log_like = nchypergeom_wallenius.logpmf(x_obs, M, n, N, odds_grid)
log_post_unnorm = log_like + log_prior
log_post_unnorm -= np.max(log_post_unnorm)
post_log = np.exp(log_post_unnorm)
post_log /= np.trapz(post_log, log_odds_grid) # density w.r.t log-odds
fig = go.Figure()
fig.add_trace(go.Scatter(x=odds_grid, y=post_log, mode="lines", name="posterior"))
fig.add_vline(x=1.0, line_dash="dash", line_color="gray")
fig.update_layout(
title=f"Posterior over odds (x_obs={x_obs}, M={M}, n={n}, N={N})",
xaxis_title="odds",
yaxis_title="density (w.r.t log-odds)",
xaxis_type="log",
)
fig.show()
# Posterior mean and a central 90% credible interval (in log-odds space)
dlog = log_odds_grid[1] - log_odds_grid[0]
cdf = np.cumsum(post_log) * dlog
cdf = cdf / cdf[-1]
lo_log = float(np.interp(0.05, cdf, log_odds_grid))
hi_log = float(np.interp(0.95, cdf, log_odds_grid))
{
"posterior_mean_odds": float(np.trapz(np.exp(log_odds_grid) * post_log, log_odds_grid)),
"credible_interval_90_odds": [float(np.exp(lo_log)), float(np.exp(hi_log))],
}
{'posterior_mean_odds': 2.0282392492127608,
'credible_interval_90_odds': [1.1560453396623622, 3.184447627708847]}
# Generative modeling: simulate biased selection outcomes for different odds
M, n, N = 150, 50, 35
odds_list = [0.5, 1.0, 2.0]
size = 30_000
rows = []
for w in odds_list:
x = sample_wallenius_numpy(M, n, N, w, size=size, rng=rng)
rows.append(
{
"odds": w,
"mean": float(x.mean()),
"var": float(x.var(ddof=0)),
}
)
rows
[{'odds': 0.5, 'mean': 7.433333333333334, 'var': 4.846355555555555},
{'odds': 1.0, 'mean': 11.633933333333333, 'var': 5.991661862222221},
{'odds': 2.0, 'mean': 16.652333333333335, 'var': 6.408194555555555}]
11) Pitfalls#
Parameter constraints:
M,n,Nmust be integers with0 ≤ n ≤ Mand0 ≤ N ≤ M;odds > 0.Notation clashes: different sources swap the meaning of
N,n,M. Always check the API.Degenerate limits:
very large
oddsconcentrates mass near (\min(n,N))very small
oddsconcentrates mass near (\max(0, N-(M-n)))
Numerical stability: PMF values can be tiny for large populations; use
logpmfwhen optimizing likelihoods.Performance:
sequential NumPy sampling is (O(N\cdot\text{size}))
PMF evaluation is nontrivial; SciPy relies on the BiasedUrn library for efficiency/accuracy.
Identifiability: from counts alone you typically cannot estimate
M,n,N,oddssimultaneously; in real studiesM,n,Nare usually known.
12) Summary#
nchypergeom_walleniusis a discrete model for biased sequential sampling without replacement from two types.Support is integer and bounded: (x\in{\max(0,N-(M-n)),\dots,\min(N,n)}).
(\omega=1) recovers the ordinary hypergeometric; (\omega>1) favors Type I.
PMF/CDF and moments are usually handled numerically (SciPy uses BiasedUrn).
Exact sampling is simple to implement via sequential biased draws (NumPy-only), and Monte Carlo checks are straightforward.